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Recomi (Repeated correlation matrix inversion) is a polynomially fast al- 
gorithm for searching optimally stable solutions of the perceptron learning 
problem. For random unbiased and biased patterns it is shown that the algo- 
rithm is able to find optimal solutions, if any exist, in at worst 0(N 4 ) floating 
point operations. Even beyond the critical storage capacity a c the algorithm 
is able to find locally stable solutions (with negative stability) at the same 
speed. There are no divergent time scales in the learning process. A full proof 
of convergence cannot yet be given, only major constituents of a proof are 
shown. 



Abstract 



Spin glass models of neural networks and their application as an associative memory 
have been of great interest in the last years [1-10]. One major issue of the field is the 
question of training networks, that is the construction of a synaptic matrix in order to 
store given information. In this paper I am going to present a training algorithm that 
is able to find solutions of the perceptron problem of optimal stability in finite time. 
Unlike other algorithms, as Minover presented by Krauth and Mezard [5] or AdaTron 
by Anlauf and Biehl [6], this algorithm does not only approximate optimal solutions but 
actually finds them. Furthermore, there are no divergent timescales in the solution of the 
problem. Minover and AdaTron both have diverging training times as the critical storage 
capacity a c is approached [6,7], whereas this algorithm does not. Therefore it can also 
be used beyond a c in the region of broken replica symmetry, where it finds local optima 
of negative stability. A similar algorithm was proposed by Rujan [8], which also finds 
optimal perceptrons in finite time, but cannot advance beyond a c . 

Like the pseudo- inverse solution of the perceptron problem [9,10] this algorithm uses 
inversion of pattern correlation matrices for searching (optimal) perceptron couplings. As 
matrix inversion has to be done repeatedly, the algorithm was called Recomi — Repeated 
correlation matrix inversion. As was shown by Opper [7] the problem of finding an 
optimal perceptron is the problem of finding the subset of embedded training patterns 
with minimal local fields. Recomi is able to find this subset of patterns iteratively in 
finite time. The coupling vector is then just the pseudo-inverse of the respective pattern 
correlation matrix. 

I consider a network of iV + 1 neurons Si = ±1, % = 1, . . . , N + 1, coupled through 
synaptic efficacies Jy (without taking self couplings into account, i.e. J a = Vi). The 
dynamics of the system is taken to be a simple zero-temperature Monte Carlo process: 

Si(t + l) = sgalj2 JiA(t)) (!) 

W*) / 

The purpose of perceptron training algorithms is to find couplings Jy such that p pat- 
terns rf = (rji, . . . , r]N + i) T , Vi = ±1, fJ> — 1, . . . ,p, become fixed points of the dynamics. 
That is 



77fE J u<>«>°> i = l,...,N + l; /i = l,...,p. (2) 

The problem can be reformulated by looking at the single neurons (or simple percep- 
trons) of the network, e.g. neuron N + 1. With 



£i=r)N+iVi, i = l,...,N; fi = l,...,p (3) 

one now has to find couplings Jj, i — 1, . . . , N, such that 

^ = E J ^> K>0 ' M=l,-..,p. (4) 

i 

If the norm of J is fixed, e.g. \J_\ = 1, it is possible to define what is meant by "optimal 
solutions" of the given problem: 

maximize k = minj/i^} under the constraint \J_\ = 1 (5) 

With maximal n one expects to have maximum stability against input noise, i.e. 
maximal basins of attraction in a network of neurons. 

From the point of view of mathematical optimization it suitable to reformulate the 
problem. With J_ — > ±L/\ K \ one gets an equivalent formulation of problem (|5|): 

minimize \J_\ under the constraints h^ = J T £ M > +1 Wfi (for k > 0) (6) 
maximize \J_\ under the constraints h^ = jj^ > — 1 V/i (for k < 0) (7) 

I will use this formulation of the problem later in this article. Applying the Kuhn- 
Tucker theorem of optimization theory [11] it can be shown [7] (see also [6]) that an 
optimal solution, for n > 0, can always be written in the form 



with 



J = y^ x ^i where x M > V/iGT 
Ater 



K = i T e{~ K ^ eV (9) 

^ - l > k else 



For k < the same argument holds for all local optima, but with x^ < V/x G T. 
T is the set of "embedded" patterns, T C {1, . . . ,p}. The x M are called the embedding 
strengths of solution J_. Anlauf and Biehl have also shown [6] that for k > this solution 
is unique (which is in general not the case for k < 0). I.e. two solutions J and J* of 
the form (§)(||) are always identical J = J*. Note that if {£ M | /i 6 T} is a set of linearly 
independent vectors -- e.g. if the patterns are in general position and card(r) < iV - 
the choice of the x^ is unambiguous. On the other hand, if one has a solution of the form 
© © h must be the global optimum of the problem. 

In the following sections I am going to describe the Recomi algorithm. Recomi can 
solve the stated problem of finding optimal perceptrons of the form ©© in finite time, if 



the training patterns are in general position, i.e. if every subset {£ M } with not more than 
N elements (card({£ M }) < N) is linearly independent. It does so in not more than 0(N 4 ) 
floating point operations. There is no divergence of learning times at the critical storage 
capacity a c = 2 (for unbiased random patterns), where a = p/N. I am going to show 
this numerically. In the last section I will deduce some important constituents of a proof 
of convergence — unfortunately a full proof cannot yet be given. I will analyze there the 
properties of locally stable solutions of the optimization problems @ and (|7|). It can be 
shown that Recomi always stops in a local optimum. If an optimal solution with k > 
exists, Recomi must stop there. Otherwise it is going to stop in one of the locally stable 
solutions with k < 0. 

Description of the algorithm 

Recomi is an iterative algorithm. It calculates coupling vectors J} 1 ' = X^^/i £ M and finds 
after a finite number of iterations a solution of the form (|8|)(|9]), if it exists. As we will see 
later, the algorithm must be initialized with positive embedding strengths xffl > 0, e.g. 
Hebbian couplings J} ' = J2u £ M - F° r numerical stability J} 1 ' is normalized to 1 after each 
iteration. Let CV be the correlation matrix of the patterns in V C {1, . . . ,p}: 

Cr = (e T C) pr (10) 

Iteration loop 

Let J} 1 ' be given (from now on I drop the index t): 

i = i>„f (ui = i ) ( n ) 

K = min{/i M } =min{j T ^} (12) 

Let T be the subset of patterns with minimal local field hu\ 



We now want to alter J 



r = {// | h fl = k} (13) 



J — . J'=£;(x^ + eAx M )f (14) 



so that for all patterns in T the local fields grow equally 



h'=f T ^ = K + e VfieT. (15) 



We therefore choose Ax to be the pseudo-inverse [9,10] of the patterns in T: 

Ax M =( E "6r( C 'i r1 )^ ^ eT . (16) 

I else 

If the training patterns £ are in general position, Cr becomes singular if and only if 
the number of patterns in T, card(T), is greater than N. Then Recomi must stop, with 
J} 1 ' being the best solution found. Nevertheless Recomi is able to find optimal solutions 
as I will show in the last section of this paper. 

Now we want to determine the learning rate e in a way that all local fields h! are 
greater or equal k + e: 

K = J! T e>K + e V/iG{l,...,p}. (17) 

e = Efj, is the value of the learning rate whith which we get the equality h! — k + s for 
pattern /i: 

£ " = 1 - V* ~C Ax • (18) 

To fulfill eqn. (p77|) e must be smaller or equal to all relevant, i.e. all positive, e^. We 
therefore define the set $: 

$ = I e^ n <£ T and < e^ < oo I . (19) 

If $ is not empty we can determine e as 

e = min$. (20) 

If $ is empty, we set e = oo, i.e. f = X) M er Ax^, and stop the iteration. 

Now J} t+1 ' = J!/\j!\ and we continue at the beginning of the iteration loop. It is 
easy to show that always K^ t+1 ^ = (k^ + £)/\J_'\ > n® (see Appendix). If no solution 
with positive k can be found the algorithm typically stops with J' = 0, as will be shown 
later. (It should be noted that this is the most sensitive part of the algorithm. Rounding 
errors must be controlled when calculating the norm of J'.) Then J} 1 ' is taken as the best 
solution found by Recomi. 

Optimal Recomi 

The algorithm I have described so far does not yet find optimal solutions of the form 



©(ED- As the changes of embedding strenghts Ax^ might be negative in eqn. (|T^) the 
Xfj. might also become negative in the end. But already this version of the algorithm does 



find nearly optimal solutions k > 0, as can be seen in fig. [I], where I compare results for 
unbiased random patterns (N = 100) with Gardner's result [3]. Therefore I refer to this 
version of Recomi as "nearly optimal Recomi" . 

To find optimal solutions of the form (§)© it is necessary to start with positive 
embedding strengths x^ > 0, and to make sure that they stay positive throughout the 
iteration, i.e. Ax^ > 0. This is possible by altering eqn. ([Uj). T must be replaced by a 
subset r'CT with the following properties: 

r'cr (21) 

V// E T (22) 




V/i e r (23) 

It is always possible to find such a subset T' (as long as CV itself is regular), be- 
cause J2u&' ^- x v£ u then is the (unique) optimal perceptron for the correct mapping of the 
patterns /i EY. 

T' can easily be determined. The following algorithm proved to work in all cases tested 
(about O(10 5 ) algorithm runs). I cannot yet prove its convergence analytically. This has 
to be done in later work. To find r" one can proceed as follows: 

1.) start with r" = T 

2.) calculate Ax^ = X^er' i^r' 1 ) (z 1 *= F'); Ax e = min M {Ax M }; if Ax e < 
remove g from T' and go to 2.) else go to 3.) 

3.) calculate Ah„ = (£ v&r , Ax v ?) T t* (ji E T\T'); Ah a = min M {A/i M }; if 
Ah a < 1 add a to V and go to 2.) else STOP 

By replacing V by r" in eqn. ( |16|) Recomi is able to find optimal solutions. I refer to 



this improved version of the algorithm as "optimal Recomi" . In fig. |2| I check for unbiased 
random binary patterns (N = 100), how often the algorithm stops in optimal solutions 
with k > 0, and in locally optimal solutions with k < 0. For every value of a = p/N 
100 different pattern sets are tested. In very rare cases (not in this figure) the algorithm 
only gets close to but does not reach optimal solutions: trying to invert nearly singular 
correlation matrices can cause failure of the inversion subroutines. 

In fig. [I] I compare results for unbiased and biased random binary patterns with Gard- 
ner's result [3]. The patterns //f are chosen with a probability distribution p(r/f) = 
( ~ m) 8(rji + 1) + 2 5(7]^ — 1), using m = (unbiased) and m = 0.8 (biased), and the 
£f calculated according to eqn. (D). Within the error bounds there is no difference to be 



seen between optimal and nearly optimal solutions below a c (k > 0). In the range of 
replica symmetry breaking a > a c (k < 0) optimal Recomi clearly performs better than 
the simpler version of the algorithm. Here it cannot be expected that the algorithm finds 
a global stability optimum, as it gets trapped in one of the many local optima, which will 
be shown in the last section of this paper. Note that for the biased patterns (m = 0.8) 
at N = 100 one still has to take finite size effects into account: the measured points are 
all optimal solutions, but yet still lie a little bit below the Gardner curve. Also note that 
the theoretical lines are all calculated in replica symmetric approximation, i.e. they must 
be corrected for negative k, where replica symmetry is no longer valid. 

In fig. ^ I train perceptrons of different sizes N with unbiased random binary patterns. 
Convergence time is plotted against system size N for different values of the storage 
capacity a. The most expensive part of the algorithm, in the large N limit, is matrix 
inversion, which is of 0(N 3 ) for each single inversion. Nearly optimal Recomi therefore 
is, in the worst case, of 0(Y / ^ =1 i 3 ) = 0(N i ), as card(T) grows at least by one in each 
iteration step. For optimal Recomi one cannot give such a simple derivation of convergence 
times, as card(T) can also shrink in the learning process. But here convergence time is also 
bounded from above by 0(N 4 ): In fig. ||] I count the number of floating point operations 
(H — */) optimal Recomi needs to find solutions. As below iV = 100 convergence time is 
still dominated by other operations apart from matrix inversion, I only plot the matrix 
inversion part here. All other operations are of 0(N 3 ) or below. Just as predicted for 
nearly optimal Recomi the optimal version of the algorithm converges in 0(N A ) or less 
floating point operations. 

In fig. [| I plot convergence time (i.e. number of floating point operations) against the 
storage capacity a. Again the perceptron (N = 100) was trained with unbiased random 
binary patterns. There is no divergence at a = a c = 2. For small a the two versions of 
the algorithm differ only little, as nearly optimal Recomi also often finds optimal solutions 
(see also fig. [I]). For larger values of a the convergence times evolve different. 

Analysis of local stability optima: towards a proof of 
convergence 

I cannot yet give a full proof of convergence of Recomi, but some major components can 
already be deduced. For this reason I want to consider the role of local stability optima. 
It is useful here to use the problem formulations eqn. @ (for k > 0) and eqn. (|7|) (for 
K < 0). If I write a ±-sign in the following text, the + always refers to the case k > 
and the — to k < 0. 



The Problem (IB) (ITT) can now be formulated as 



minimize f(x) = ± J_ J_ = ± xJCx_ 
under the constraints h^ = J T £ M = (Cx)^ > ±1 \x — 1, . . . ,p. (24) 

T is the set of patterns with minimal local field: 

T = { l i\h ll = (Cx)„ = ±1 } • (25) 

Let Q be the set of all possible search directions Ax, which do not violate the inequality 
constraints eqn. (|24]): 



Q = [ Ax G R p (CAx)^ > V/x 6 T | (26) 

A solution J is locally optimal if and only if 

V x f(xj\ T Ax = ± 2 x T CAx > VAx G Q (27) 

I now prove the important theorem, that if there is a solution with positive stability 
k > there cannot be locally stable solutions J with negative stability k < and x M > 
VyU, E/.3V > 0: 

If there is a solution with k > there must be a solution of the form (e.g. the optimal 
perceptron) 

J* = 5>;e\ with J^C = (CV% > k* > V/i (28) 

Let us assume J = Su^C^ is locally optimal with k = min^lJ 7 ^} < and x M > V//, 
X^^V > 0. That means (eqn. (|27D): 

/CAi < VAx G fi (29) 

As (CV% > k* > V/i, we have: 

Ax = x*eQ (30) 

x T CAx = x T Cx* = J2 x^Cx*)^ > k* Y, x„ > (31) 

in contradiction to eqn. (^)! Therefore such a vector J cannot exist. We will see below 
that optimal Recomi always stops in (local) optima which by definition of the algorithm 
are of the form x^ > V/i and J2fj, x n > 0- So if there is any solution with k > Recomi 
can only stop in the global optimum of the problem, because then there are no other 
optima of that form. 



To show this, I have to make several assumptions, which I cannot prove yet: a) The 
algorithm described in section "Optimal Recomi" for deriving r" really always works, b) 
The size of T, card(T), grows not more than by one in each iteration step, especially not 
from card(r) < N to card(T) > N. c) Recomi really terminates in finite time. About this 
last point one can only say that k® is a strictly monotonical function oft (see Appendix), 
i.e. there is always an attractor of the training dynamics. 

If these three assumptions are correct, Recomi stops in a (local) optimum, which is 
the global one, if solutions k > exist. To show this I have to consider the three possible 
ways the algorithm does stop: 1) $ is empty, i.e. e becomes infinit. 2) J is zero. 3) C-p 
is singular. 

1) $ is empty: This is the most simple case. Then, by definition, J' = X^er' Ax^, 
which is an optimal solution of the form (§)([5|). This is the usual way Recomi stops if 
solutions k > exist. 

2) «/' is zero: Then J}*' = — s X^er' Aa^£^. Applying the Kuhn- Tucker theorem this 
is a locally stable solution for k < (just like (§|)(0) for « > 0). As J}*' is coded in the 
form x^> V/i and £)„ x ^ > there cannot be solutions with k > as was shown above. 
This is the usual way Recomi stops if no solutions k > exist. 

3) Cr is singular: Then card(r) > N (because the training patterns are in general 
position). According to our assumption, card(r) must have been N in the iteration step 
before. T' must have been equal to T because otherwise card(r) would not have grown. 
As {£ M |/i G T} does span M N , J_ ^ is completely determined by the local fields J} 1 ^ £ M 
/x G T, i.e. J} 1 " 1 ' ~ X^er ^ x ^i which is a local optimum. Therefore case 3) does in 
principal never occur, the algorithm stops before in 1) or 2). 

In practice case 3) does occur, as sometimes nearly singular correlation matrices cannot 
be inverted by the inversion subroutines because of numerical restrictions. 

Conclusion 

In this article I presented a perceptron learning algorithm, which is able to find the 
optimal perceptron in finite time, i.e. in 0(N 4 ) floating point operations. The algorithm 
even works beyond the critical storage capacity a c , where it finds solutions of negative 
stability that are locally optimal. Calculating the stability curve n(a) for random training 
patterns exactly reproduces Gardner's predictions [3]. A full prove of convergence could 
not yet be given, but major constituents were already shown. As the algorithm works very 
reliably, it can be expected that a full proof of convergence can be found. Furthermore it 
is planned to generalize the algorithm to two layer perceptrons with fixed output. First 



results are very promising, yet it cannot be expected that the algorithm finds globally 
optimal solutions, because replica symmetry breaking effects are very strong in this case. 

Appendix 

In this appendix I will show that «'*' is a strictly monotonical function of t: 

«(*-!) = ^±£ (32) 

q = (]T Ax.e] = E A^(C r )^A^ = E Ax^Cr^C^U = Y. Ax »>° ( 33 ) 

J' T J! = ( i (t) + £ E Aa ^i M ) = ! + 2e« (t) £ + £ 2 £ > V^ G iR (34) 

e.g. e = -K {t) => 1 - /t (t)2 £ > (35) 

^k^ = (l + 2e«W e + e 2 e) _3/2 (l - K^g) > (36) 

d_ K (t+i) = o if and only if Recomi stops in a (local) optimum: 

A«(*+i) =0^1- «;(*) 2 ^ = «=► JW = «W ]T A ^f ( 37 ) 

That means «;( t+1 ) > «;(*) as long as Recomi has not terminated, qed. 
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Figure 1: Comparison of Recomi with Gardner's result, N = 100, 100 sets of unbiased 
(m = 0) and biased (m = 0.8) random binary patterns for each measurement 
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Figure 2: Optimal Recomi, N = 100, 100 sets of unbiased random binary patterns for 
each value of a 
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Figure 3: Convergence time against system size for optimal Recomi (unbiased random 
binary patterns) 
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Figure 4: Convergence time against storage capacity a (unbiased random binary pat- 
terns, N = 100). There is no divergence at a c = 2. 
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